Prediction of hepatic metastasis in esophageal cancer based on machine learning

This study aimed to establish a machine learning (ML) model for predicting hepatic metastasis in esophageal cancer. We retrospectively analyzed patients with esophageal cancer recorded in the Surveillance, Epidemiology, and End Results (SEER) database from 2010 to 2020. We identified 11 indicators associated with the risk of liver metastasis through univariate and multivariate logistic regression. Subsequently, these indicators were incorporated into six ML classifiers to build corresponding predictive models. The performance of these models was evaluated using the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, and specificity. A total of 17,800 patients diagnosed with esophageal cancer were included in this study. Age, primary site, histology, tumor grade, T stage, N stage, surgical intervention, radiotherapy, chemotherapy, bone metastasis, and lung metastasis were independent risk factors for hepatic metastasis in esophageal cancer patients. Among the six models developed, the ML model constructed using the GBM algorithm exhibited the highest performance during internal validation of the dataset, with AUC, accuracy, sensitivity, and specificity of 0.885, 0.868, 0.667, and 0.888, respectively. Based on the GBM algorithm, we developed an accessible web-based prediction tool (accessible at https://project2-dngisws9d7xkygjcvnue8u.streamlit.app/) for predicting the risk of hepatic metastasis in esophageal cancer.


Study population
In the study, we used SEER*stat 8.4.1 software to download the patients' data from the SEER database.Patients diagnosed with esophageal cancer (SCC and AC) between 2010 and 2020 were involved in this study.Exclusion criteria were detailed as follows: (1) Excluded unknown bone, brian, liver and lung metastatic status; (2) Excluded unknown AJCC T, N stage; (3) Excluded unknown race and histology grade; (4) Excluded unknown primary site; (5) Excluded unknown Histologic Type and Surgery; (6) Excluded unknown marital status.A study flow chart of case screening was presented in Fig. 1.

Data selection
In this study, 16 variables related to the clinicopathology and demographics of patients were selected for analysis.Demographic variables included age, sex, marital status, race.Clinicopathological variables included primary site, tumor histology, tumor grade, T stage, N stage, surgery, radiation, chemotherapy, brain metastasis, bone metastasis, lung metastasis, liver metastasis.According to the ICD-O-3 codes, histological types of esophageal cancere divided into 2 categories, including adenocarcinoma (8140-8573), squamous cell carcinoma (8050-8082).All esophageal cancer patients were staged according the AJCC 8th edition guidelines and SEER staging information.In addition, X-tile software was used to calculate cut-off value of age.

Data pre-processing and feature engineering
All statistical analyses were conducted with Python3.8,SPSS 23.In this study We performed a logistic regression analysis on data collected in the SEER database to identify suitable variables for machine learning model by using SPSS 23 software.Significant variables from HM patients were identified by univariate logistic regression analysis (P < 0.05).Then, these variables were enclosed within multivariate logistic regression analysis, and variables with a P < 0.05 in multivariate logistic regression analysis were subjected for further analysis of ML model.Correlation analysis was used to analyze the correlation among the selected features.Since this data set is an unbalanced data set, the over-sampling method were adopted for data processing 15 .The key of this method is to oversampling the

Correlation analysis and Importance of features on prediction
In order to assess the level of correlation between factors, correlation analysis is commonly employed.In this study, we utilized Spearman correlation analysis to examine the independence between data features.A correlation heat map was generated, as depicted in Fig. 2A, which depicted the absence of significant correlation among the 15 features under investigation.Figure 2B presents the significance of features extracted from each machine learning algorithm.The variables identified through univariate and multivariate logistic analysis have all played a remarkable role in predicting outcomes across the six models.Notably, surgery consistently emerged as the most influential feature in the majority of prediction models, underscoring its significant impact on hepatic metastasis in esophageal cancer.In most algorithms, T stage, age, primary, N stage and tumor grade ranked the last five, with no significant difference in their contributions to the model.Lung metastasis, radiation, bone metastasis, histology, chemotherapy, T stage, age, primary, N stage and tumor grade are arranged in descending order in GBM model.

Model performance
The performance of the six predictive models is described in Fig. 3A,B and Table 3. Internal ten-fold cross-validation (Fig. 3A) showed that GBM model performed best among the six models with an average AUC of 0.893, followed by the LR model (AUC = 0.882).Internal test validation was shown in Table 3 and Fig. 3B.Interestingly, the GBM model also achieves the best AUC score (0.885) in the internal test validation and the score of accuracy, sensitivity (recall rate) and specificity were 0.868, 0.667 and 0.888, respectively.The confusion matrix (Fig. 3C) of the GBM model in the training set and the test set indicated its high accuracy.The probability density plot (Fig. 3D) depicting predictive distribution showed that the AUC was highest when the predictive score was 0.38.The CUC plot (Fig. 3E) also showed good clinical applicability.

Web predictor
This study aimed to develop a web predictor utilizing the GBM model, which exhibited superior predictive performance for hepatic metastasis in patients with esophageal cancer.The primary objective of this web predictor is to provide doctors with a valuable tool for making more precise clinical decisions.By inputting the relevant variables associated with hepatic metastasis into the web predictor, healthcare professionals can conveniently calculate the odds of hepatic metastasis in patients with esophageal cancer.For easy access, the web predictor can be accessed at the following link: (https:// proje ct2-dngis ws9d7 xkygj cvnue 8u.strea mlit.app/).Please refer to Fig. 4

Discussion
Esophageal cancer is a remarkably fatal malignancy, with a prevalence of distant metastases reaching up to 42% in newly diagnosed patients, prominently affecting the liver as the most frequently involved organ [26][27][28] .The effective treatment and comprehensive management of metastatic esophageal cancer necessitate a multimodal strategy, which continues to pose significant challenges.Therefore, it is of crucial significance for clinical decision-making to identify high-risk factors of esophageal cancer and accurately predict whether patients will develop liver metastasis based on their individual and unique clinical and pathological characteristics.
Currently, the HM of advanced esophageal cancer remains understudied in the scientific literature.Prognostic research in this domain is predominantly focused on two key aspects.Firstly, there is a conspicuous paucity of exploratory investigations into the high-risk prognostic factors associated with esophageal cancer.Additionally, further exploration of the interrelationships among these independent prognostic factors is noticeably lacking.Secondly, there is a dearth of research on HM models for advanced esophageal cancer that leverage the immense potential of big data.Consequently, there is an urgent need for comprehensive studies in these areas to contribute to an improved understanding and accurate prognostication of advanced esophageal cancer.
Some studies believe that smoking and drinking are the most common risk factors for male esophageal cancer 29 .Some previous studies 30 have also shown that for cancer patients, the degree of tissue differentiation, pathological N-stage, vascular invasion, and neuroinvasion are recognized factors that affect the prognosis of patients with esophageal cancer [31][32][33][34] .The conclusions of these studies lacked the support of big data and did not address the prediction on HM of advanced esophageal cancer.Based on big data analysis of SEER database, our study screened out independent high risk factors associated with HM by logistic regression analysis.This study included 15 clinically common relevant factors associated with advanced esophageal cancer with liver metastasis, which are: age, sex, Marital status, Race, Primary Site, Tumor histology, Tumor grade, T stage, N stage, Surgery, Radiation, Chemotherapy, Brain metastasis, Bone metastasis, Lung metastasis.To identify the independence between features, we obtained a correlation heat map by Spearman correlation analysis.There was no strong correlation among these 15 features by the Fig. 2A.Moreover, 11 independent high risk factors related to liver metastasis were screened by logistic regression analysis, which were as follows: age, Primary Site, Tumor histology, Tumor grade, T stage, N stage, Surgery, Radiation, Chemotherapy, Bone metastasis, Lung metastasis.
Undoubtedly, the construction of prediction models for HM of advanced esophageal cancer is equally significant to the exploration of independent high risk factors in this context.Presently, there is a notable dearth of studies focused on risk factors in esophageal cancer patients with distant organ metastases 35 .For instance, Tang et al. previously constructed a nomogram to predict the survival of patients with metastatic esophageal cancer; however, this study encompassed metastases to all anatomical sites, without specifically exploring a prediction model for predicting the risk of distant metastasis 36 .Similarly, Cheng et al. established models for predicting both the risk and survival of esophageal cancer patients, albeit those specifically tailored to brain metastasis 37 .Furthermore, Guo et al. provided detailed characteristics and explored risk and prognostic factors for patients with liver metastasis, yet they did not develop any predictive tools 38 .Considering that liver metastasis represents the most common site of distant spread, conducting a comprehensive investigation specifically targeting esophageal cancer patients with liver metastasis assumes paramount clinical importance.
Previous studies have constructed nomograms to predict EC metastasis based on traditional logistic models.However, the limitations of this method in prediction accuracy and processing big data have made it difficult to make great breakthroughs in precision medicine 9,10 .And traditional research cannot exploration the interaction between different independent high risk factors 18,19 .In contrast, our study can better document complex associations between different independent high risk factors, thereby improving the accuracy of the model 20 .Previous studies have used nomogram methods to build a model for predicting the metastasis of patients with esophageal cancer based on the data of patients with esophageal cancer in the SEER database, but these studies did not involve the establishment of a predicting model for HM of advanced metastatic esophageal cancer by ML 21 .
We then constructed six prediction models using ML, Internal ten-fold cross-validation (Fig. 3A) showed that GBM model performed best among the six models.Leveraging these findings, we have successfully devised an openly accessible online calculator (https:// proje ct2-dngis ws9d7 xkygj cvnue 8u.strea mlit.app/) based on the GBM model.The model we have developed accurately predicts patients' risk of HM based on various clinical indicators.Clinicians can access this model through the provided website to input patient information and obtain corresponding predictions of hepatic metastases, thereby facilitating clinical decision-making.
Our research has the following advantages.Firstly, this study established a statistical model based on machine learning that can predict the HM of patients with EC.To the best of our knowledge, we are the first to use ML to construct a prediction model of LM of EC.This model is more reliable than the traditional nomogram prediction model.And this work expanded our knowledge of advanced EC.Second, our study further explores the relationship between different independent high risk factors, which provides a new direction for future clinical research.In other words, clinical research should not only explore the metastasis of patients, but also explore the correlation between different independent high risk factors, so as to better find the relationship between these factors and further eliminate the factors that are not conducive to the metastasis of patients during perioperative period.
Meanwhile, this study has some limitations.First, Current machine learning is almost entirely statistical or black-box, bring severe theoretical limitations to its performance 23 .Second, this study is a single-center study with limited number of patients included, and the application of machine learning model on large data sets can obtain more stable results 22 .Therefore, in subsequent studies, multi-center data can be added for training and external verification, so as to obtain a more reliable prediction model.Third, this study did not include neoadjuvant therapy, surgical methods, circulating tumor DNA and other factors that may affect the long-term prognosis of patients with esophageal cancer.In the future, with the continuous improvement of the database, we will incorporate more correlation parameters associated with the HM of EC into the web predictor to improve its adaptability.

Conclusion
In summary, this study built a machine learning model for predicting liver metastasis of esophageal cancer based on 11 clinicopathological features commonly seen in clinical work, among which GBM model performed best.GBM model can be used to predict liver metastasis of esophageal cancer, and then help clinicians to make more accurate treatment plan for patients with esophageal cancer.

Figure 1 .
Figure 1.The study flow chart of case screening.

Figure 2 .
Figure 2. (A) Heat map of the correlation of features.(B) Feature importance of different models.

Figure 3 .
Figure 3. (A) Ten-fold cross-validation results of different machine learning models.(B) The roc curves of different machine learning models in internal test set.(C) The confusion matrix of the GBM model in the train set and the internal test set.TP true positive, TN true negative, FP false positive, FN false negative.(D) Probability density plot of gradient boosting machine model.(E) The clinical impact curve of gradient boosting machine model.

Table 1 .
Clinical and pathological characteristics of train set and internal test set.

Table 2 .
Univariate analysis and multivariate logistic regression analysis of variables.

Table 3 .
Prediction performance of different models.